Code Mixed Cross Script Question Classification
نویسنده
چکیده
With the growth in our society, one of the most affected aspect of our routine life is language. We tend to mix our conversations in more than one language, often mixing up regional language with English language is a lot more common practice. This mixing of languages is referred as code mixing, where we mix different linguistic constituents such as phrases, proper nouns, morphemes etc. to come up code mixed script. With exponential growth of social media, we are using more and more code mixed cross script for our conversation on Facebook, WhatsApp, or Twitter. On the other hand, the language should be understood by the automated question answering system which is one of the most import application of AI. And now the trend is code mixed languages but current work is around a single language. At FIRE 2016, as a part of Shared Task1 CMCS (Code Mixed Cross Script Question Classification), we have worked on the problem of classify a code mixed question into 9 given classes. Shared Task is focused on Indian regional languages, wherein we worked on BengaliEnglish code mixed cross script questions classification. As scripting used in training data is English only, so all Bengali text was also written using English script only. We have used Machine Learning for question classification and used ensemble based Random Forest algorithm. As it’s a code-mixed script, so traditional NLP components may not work well, so worked on a custom solution using own set of features for Classification. CCS Concepts • Theory of computation Random Forest • Computing methodologiesNatural language Processing
منابع مشابه
Modeling Classifier for Code Mixed Cross Script Questions
With a boom in the internet, the social media text had been increasing day by day and the user generated content (such as tweets and blogs) in Indian languages are written using Roman script due to various socio-cultural and technological reasons. A majority of these posts are multilingual in nature and many involve code mixing where lexical items and grammatical features from two languages app...
متن کاملEnsemble Classifier based approach for Code-Mixed Cross-Script Question Classification
With an increasing popularity of social-media, people post updates that aid other users in finding answers to their questions. Most of the user-generated data on social-media are in code-mixed or multi-script form, where the words are represented phonetically in a non-native script. We address the problem of Question-Classfication on social-media data. We propose an ensemble classifier based ap...
متن کاملNLP-NITMZ @ MSIR 2016 System for Code-Mixed Cross-Script Question Classification
This paper describes our approach on Code–Mixed Cross– Script Question Classification task, which is a subtask 1 of MSIR 2016. MSIR is a Mixed Script Information Retrieval event in conjunction with FIRE 2016, which is the 8th meeting of Forum for Information Retrieval Evaluation. For this task, our team NLP–NITMZ submitted three system runs such as: i) using a direct feature set; ii) using dire...
متن کاملThe First Cross-Script Code-Mixed Question Answering Corpus
In this paper, we formally introduce the problem of crossscript code-mixed question answering (QA) and we elaborate the corpus acquisition process and an evaluation strategy related to the said problem. Today social media platforms are flooded by millions of posts everyday on various topics. This paper emphasizes the use of such ever growing user generated content to serve as information collec...
متن کاملAmrita-CEN@MSIR-FIRE2016: Code-Mixed Question Classification using BoWs and RNN Embeddings
Question classification is a key task in many question answering applications. Nearly all previous work on question classification has used machine learning and knowledge-based methods. This working note presents an embedding based Bag-ofWords method and Recurrent Neural Network to achieve an automatic question classification in the code-mixed BengaliEnglish text. We build two systems that clas...
متن کامل